JP-3741: Faster temporary file I/O for outlier detection in on-disk mode #8782

emolter · 2024-09-12T22:14:23Z

Resolves JP-3741
Resolves JP-3762
Resolves JP-3764

Closes #8774
Closes #8829
Closes #8834

This PR has expanded to become a relatively major refactor of the outlier detection step. It achieves the following:

Makes the outlier detection step run faster when in_memory=False by decreasing the data volume that needs to be read from disk during the median calculation. Prior to this PR, the get_sections iterator was loading the entirety of each resampled datamodel every time it extracted one spatial segment. This led to lots of unnecessary file I/O: if you had, say, 30 resampled images of 100 MB for which you wanted to compute the median in 50 sections, get_sections would read 30x50x100 MB of data in order to perform an operation that required just 30x100 MB as input. This PR makes it so each model is only read once (instead of n_sections times) by storing each section in its own appendable on-disk data array. For the example above in the new code, 50 on-disk arrays will be instantiated at the start of the operation, and then write_sections will load each resampled datamodel only once, appending (writing) one spatial section of that datamodel to each of the on-disk arrays until the whole datamodel is stored as part of 50 time stacks. Each of the 50 on-disk arrays has final shape (n_images, section_nrows, section_ncols). Each of these is then read one-by-one and the median is computed and stored for that section. The requisite 30x100 MB of temporary storage is saved only once and loaded only once. The on-disk arrays are deleted at the end of processing. A runtime comparison can be seen on the JIRA ticket: the amount of time spent on file I/O decreased by a factor of 100 for my test dataset.
Refactors the resampling in outlier detection such that each drizzled_model (one per group) could be written into the OnDiskMedian as soon as it was computed. This is better than saving it in a "median-unfriendly" way and then immediately loading it just to re-save it into a more "median-friendly" file structure. This refactor also made it unnecessary to store all the drizzled models in memory at once when in_memory=True - now only the data extension needs to be kept for median computation.
Leaves the behavior of the resample_many_to_many function unchanged. It was also discussed among Nadia, Mihai, Brett, Mairan, and I whether it's necessary to keep resample_many_to_many around given that this PR bypasses it in the outlier detection step. Its removal would make it so ResampleStep and ResampleSpecStep could not be run in "single" mode, so the change in behavior might extend beyond outlier detection usages. We think the answer is that it can probably be removed, but it would be outside the scope of this PR.
Adds some memory-saving options to np.nanmedian. A memory usage comparison can be seen on the JIRA ticket: peak memory usage with in-memory=True went from 34 GB on main to 15.4 GB on this PR branch for one test association (see the ticket here), and from 21 GB to 8.5 GB for a different one (see the write-up here).
Fixes intermediate file names in spectroscopic modes. Prior to this PR, the _median, _blot, and _outlier_?2d models would be saved to the same filename for all slits in a MultiSlitModel, overwriting each other. This PR adds the slit name to the filename. Separate ticket was made to ensure this will be specifically tested by INS.

Tasks

emolter · 2024-09-12T22:20:23Z

initial set of regression tests running here

codecov · 2024-09-12T22:54:49Z

Codecov Report

Attention: Patch coverage is 95.52846% with 11 lines in your changes missing coverage. Please review.

Project coverage is 61.86%. Comparing base (e860360) to head (2d5281c).
Report is 3 commits behind head on main.

Files with missing lines	Patch %	Lines
jwst/outlier_detection/utils.py	97.20%	5 Missing ⚠️
jwst/resample/resample.py	90.62%	3 Missing ⚠️
jwst/outlier_detection/coron.py	0.00%	1 Missing ⚠️
jwst/outlier_detection/spec.py	80.00%	1 Missing ⚠️
jwst/outlier_detection/tso.py	0.00%	1 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #8782   +/-   ##
=======================================
  Coverage   61.86%   61.86%           
=======================================
  Files         377      377           
  Lines       38911    38952   +41     
=======================================
+ Hits        24071    24097   +26     
- Misses      14840    14855   +15

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

emolter · 2024-09-13T16:49:12Z

Most recent changes fixed the bug (at least, locally on my machine) uncovered by the regression tests, but I'll wait for some reviews before spamming Jenkins with another regression test run

braingram · 2024-09-13T20:16:37Z

jwst/outlier_detection/spec.py

+        buffer_size = None
+    else:
+        # reasonable buffer size is the size of an input model, since that is
+        # the smallest theoretical memory footprint


I'm wondering where this limit came from.

I think since the drizzled models are in memory (for in_memory=True) we will have to have at least 1 drizzled model in memory. We'll also always have all the input models in memory. So during generation of the 1st drizzled models we will have some baseline memory usage + all input models + 1 drizzled model. If we split up the drizzled model and write it out to many files (what this PR does, right? I haven't finished looking at it...) we can throw out the drizzled model before generating the next one (not what this PR does but what we might work towards). So our peak memory usage will be: baseline + n input models + 1 drizzled models. After drizzling we can go back down to baseline + n input models. We will have to then create a median image (equal to 1 drizzled array, but not model since we don't need weights etc) and then load each temporary file and fill in the median image. I think that gives a peak memory usage of: baseline + n input models + median image. I think this is less than baseline + n input models + 1 drizzled models by at least the weight array (the same size as the drizzled data). That's a very long winded way of saying, can't this be the size of the drizzled array and not the input array? If so, I think it would simplify the API as the buffer size could be computed within create_median (as it's written now).

I agree with you that we can use the drizzled array size, so I will change that.

I'm almost certainly missing something, but why do all the input models need to be in memory? Do you mean within resample.do_drizzle or somewhere else?

You're right. My Friday brain was conflating this with the in memory case. Ignore the bit about the n input models. The statements about the memory peak still hold (I think, I could be missing something and haven't tested this).

The default buffer size computation should be scaled by the size of the resampled images now, let me know if it looks ok to you

melanieclarke · 2024-09-13T20:17:55Z

jwst/outlier_detection/utils.py

+        filestem : str
+            The stem of the temporary file name. The extension ".bits" will be added.
+        """
+        self._temp_dir = tempfile.TemporaryDirectory(dir=tempdir)


It looks like this writes the tempdir to the current working directory by default - is that desired? And might it lead to more collisions in ops? The output directory may not be the same as the current directory.

I don't think it should lead to collisions because the temporary directory should be unique, even though yes by default it's a subdirectory within the cwd. I have no idea what default would be desired in ops, but I believe (@braingram can tell me if this is not correct) ModelLibrary makes its temporary directory in the same way.

Yeah ModelLibrary uses the same pattern for temporary files:
https://github.com/spacetelescope/stpipe/blob/ecd5d162be425c24db2498b34bcbaeccec4ac557/src/stpipe/library.py#L181
Although currently those aren't used in ops (as far as I'm aware the libraries should always be on_disk=False except for the one in resample which points to the already generated models in the output directory). Finding (and configuring) a suitable tmpdir is likely worth exploring if we don't disable all tempfiles for ops.

Hmm, that seems like it might be a little problematic. Why not just leave the default at None, so that the temporary directory is the system-dependent default?

Setting the default to None for the DiskAppendableArray in this PR makes sense to me, will do. Any change to tempdirs for ModelLibrary is beyond the scope of this PR, but let's discuss more if this is something you'd like to see done

I believe there are issues with using the default /tmp on systems in ops. @jhunkeler may know more details but I believe that /tmp is not writeable for "security" reasons. What about using the same directory that is currently used on main for the temporary models?

I can ask team Coffee about this at our 3:00 meeting

@melanieclarke we talked about this with Hien, Eric, and Jesse. It sounds like there's not really any "good" place to do file I/O on their end. We confirmed though that right now all temporary files are written to the current working directory, including those made by the old (but currently operational) ModelContainer when in_memory=False as is the current outlier detection default. Based on that conversation I think leaving this as the default is likely the safest thing to do.

@braingram Sorry I missed this yesterday. The issue is that /tmp is mounted with the noexec flag:

cat << EOF > /tmp/myscript.sh #!/usr/bin/env bash echo "hello world" EOF chmod +x /tmp/myscript.sh /tmp/myscript.sh # result -bash: /tmp/myscript.sh: Permission denied

jwst/outlier_detection/utils.py

…-3741

mairanteodoro

Looks good to me! Thanks, @emolter!

melanieclarke

I think everything looks good now. Thanks very much to @emolter and @braingram for hashing this out together - I agree that this is a significant improvement to the step!

melanieclarke · 2024-09-26T21:07:50Z

@emolter - can we re-run the full regression test set before we merge?

@zacharyburnett - are you happy with the updates to the change log?

emolter · 2024-09-26T21:47:22Z

started regression tests here

nden · 2024-09-27T12:54:44Z

I haven't looked at the details of the PR but are any updates to the outlier_detection docs needed?

emolter · 2024-09-27T13:20:59Z

I haven't looked at the details of the PR but are any updates to the outlier_detection docs needed?

Actually, yes, there's one bullet in outlier_detection_imaging that is wrong now about the section size. I will update that now

…-3741

melanieclarke · 2024-09-27T13:28:34Z

started regression tests here

Looks good, thanks!

emolter · 2024-09-27T13:32:37Z

@nden how does 2d5281c look? Not sure whether so many implementation details should be in the user-facing docs, but at least the mention of a 1 MB buffer size has been removed.

mairanteodoro

Thanks for updating the docs, @emolter!

emolter · 2024-09-27T14:31:00Z

I haven't looked at the details of the PR but are any updates to the outlier_detection docs needed?

Going to go ahead and merge this, assuming the small change I made is good enough to at least remove incorrect information. Additional docs changes to outlier detection can be deferred to #7817

…ode (spacetelescope#8782)

emolter added 5 commits September 12, 2024 09:48

draft implementation using the TempArrayHandler

45f6281

added nicer errors and type hints to TempArrayHandler

a08277f

first draft of appending to disk

d11b6f9

Merge remote-tracking branch 'upstream/master' into JP-3741

4820bdf

add weight handling, change the way buffer compute works

131cfec

github-actions bot added outlier_detection testing labels Sep 12, 2024

small cleanups from code self-review

a8c4c12

emolter requested a review from braingram September 12, 2024 22:36

emolter removed the testing label Sep 12, 2024

stscijgbot-jp mentioned this pull request Sep 12, 2024

Temporary files in outlier detection median computation should read faster #8774

Closed

emolter added 2 commits September 13, 2024 09:28

Merge remote-tracking branch 'upstream/master' into JP-3741

c05694b

bugfix for failing regtest and additional docstrings

ac1a932

github-actions bot added the testing label Sep 13, 2024

emolter added no-changelog-entry-needed and removed no-changelog-entry-needed labels Sep 13, 2024

added changelog entry

177ab42

emolter changed the title ~~WIP: JP-3741: Faster temporary file I/O for outlier detection in on-disk mode~~ JP-3741: Faster temporary file I/O for outlier detection in on-disk mode Sep 13, 2024

emolter marked this pull request as ready for review September 13, 2024 15:05

emolter requested a review from a team as a code owner September 13, 2024 15:05

braingram reviewed Sep 13, 2024

View reviewed changes

melanieclarke reviewed Sep 13, 2024

View reviewed changes

braingram reviewed Sep 13, 2024

View reviewed changes

jwst/outlier_detection/utils.py Show resolved Hide resolved

braingram reviewed Sep 13, 2024

View reviewed changes

jwst/outlier_detection/utils.py Outdated Show resolved Hide resolved

emolter added 2 commits September 13, 2024 16:51

attempted decrease memory usage by allocating cube first

51f126e

Merge remote-tracking branch 'upstream/master' into JP-3741

1a54f6e

emolter force-pushed the JP-3741 branch from f7d9b55 to 1a54f6e Compare September 13, 2024 20:52

braingram self-requested a review September 26, 2024 16:22

stscijgbot-jp mentioned this pull request Sep 26, 2024

Allow resampling of a single group #8834

Closed

remove unnecessary use of global

7c64920

emolter requested review from melanieclarke and mairanteodoro September 26, 2024 17:26

emolter added 2 commits September 26, 2024 15:19

Merge branch 'main' of https://github.com/spacetelescope/jwst into JP…

f2a1399

…-3741

make median computer helper functions agnostic to input data

9c0f0af

mairanteodoro approved these changes Sep 26, 2024

View reviewed changes

melanieclarke approved these changes Sep 26, 2024

View reviewed changes

zacharyburnett approved these changes Sep 27, 2024

View reviewed changes

Merge branch 'main' of https://github.com/spacetelescope/jwst into JP…

63702f8

…-3741

updated memory model section of outlier detection imaging docs

2d5281c

github-actions bot added the documentation label Sep 27, 2024

mairanteodoro reviewed Sep 27, 2024

View reviewed changes

emolter merged commit 15fa0be into spacetelescope:main Sep 27, 2024
30 checks passed

emolter deleted the JP-3741 branch September 27, 2024 14:31

stscijgbot-jp mentioned this pull request Sep 27, 2024

Intermediate files in outlier_detection clobber each other for MultiSlitModel inputs #8829

Closed

This was referenced Sep 27, 2024

JP-3768: Move outlier detection median computers to stcal spacetelescope/stcal#292

Merged

Documentation and code updates clarifying the behavior of the in_memory flag #8851

Merged

melanieclarke mentioned this pull request Oct 2, 2024

JP-3768: Fix intermediate issues with non-resampled outlier methods #8853

Merged

10 tasks

emolter mentioned this pull request Oct 7, 2024

Use nanmedian3D for TSO data in outlier detection #8859

Merged

10 tasks

melanieclarke mentioned this pull request Oct 21, 2024

Switch resample to the new drizzle package API #8866

Merged

10 tasks

hayescr pushed a commit to hayescr/jwst that referenced this pull request Oct 29, 2024

JP-3741: Faster temporary file I/O for outlier detection in on-disk m…

cee57f2

…ode (spacetelescope#8782)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JP-3741: Faster temporary file I/O for outlier detection in on-disk mode #8782

JP-3741: Faster temporary file I/O for outlier detection in on-disk mode #8782

emolter commented Sep 12, 2024 •

edited

Loading

emolter commented Sep 12, 2024

codecov bot commented Sep 12, 2024 •

edited

Loading

emolter commented Sep 13, 2024

braingram Sep 13, 2024

emolter Sep 13, 2024

braingram Sep 13, 2024 •

edited

Loading

emolter Sep 16, 2024

melanieclarke Sep 13, 2024

emolter Sep 13, 2024

braingram Sep 13, 2024

melanieclarke Sep 16, 2024

emolter Sep 16, 2024

braingram Sep 16, 2024 •

edited

Loading

emolter Sep 16, 2024

emolter Sep 16, 2024

jhunkeler Sep 17, 2024

mairanteodoro left a comment

melanieclarke left a comment

melanieclarke commented Sep 26, 2024

emolter commented Sep 26, 2024

nden commented Sep 27, 2024

emolter commented Sep 27, 2024

melanieclarke commented Sep 27, 2024

emolter commented Sep 27, 2024

mairanteodoro left a comment

emolter commented Sep 27, 2024

JP-3741: Faster temporary file I/O for outlier detection in on-disk mode #8782

JP-3741: Faster temporary file I/O for outlier detection in on-disk mode #8782

Conversation

emolter commented Sep 12, 2024 • edited Loading

Tasks

emolter commented Sep 12, 2024

codecov bot commented Sep 12, 2024 • edited Loading

Codecov Report

emolter commented Sep 13, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

braingram Sep 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

braingram Sep 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mairanteodoro left a comment

Choose a reason for hiding this comment

melanieclarke left a comment

Choose a reason for hiding this comment

melanieclarke commented Sep 26, 2024

emolter commented Sep 26, 2024

nden commented Sep 27, 2024

emolter commented Sep 27, 2024

melanieclarke commented Sep 27, 2024

emolter commented Sep 27, 2024

mairanteodoro left a comment

Choose a reason for hiding this comment

emolter commented Sep 27, 2024

emolter commented Sep 12, 2024 •

edited

Loading

codecov bot commented Sep 12, 2024 •

edited

Loading

braingram Sep 13, 2024 •

edited

Loading

braingram Sep 16, 2024 •

edited

Loading